Configuring Databricks

Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions. Databricks is used for ETL and for managing security, governance, data discovery among it's various other features.

The Lazsa Platform uses Databricks for various operations like data integration, data transformation, and data quality. After you save the connection details for Databricks, you can use it in any of the nodes in a data pipeline, as mentioned earlier.

Prerequisites

The following permissions are required for configuring Databricks:

  • s3:ListBucket

  • s3:PutObject

  • s3:GetObject

  • s3:DeleteObject

  • s3:PutObjectAcl

To access an Amazon S3 bucket from Databricks, you must create an instance profile with read, write, and delete permissions. For detailed instructions on how to create an instance profile, refer to the following link: instance-profile-tutorial.html

To configure the connection details of your Databricks, do the following:

  1. Sign in to the Lazsa Platform and click Configuration in the left navigation pane.
  2. On the Platform Setup screen, on the Cloud Platform, Tools & Technologies tile, click Configure.
  3. On the Cloud Platform, Tools & Technologies screen, in the Data Integration section, click Configure.
    Data Integration Tools
  4. (After you save your first connection details in this section, you see the Modify button here.)
  5. On the Databricks screen, do the following:
    1. In the Details section, provide the following details:

      FieldDescription
      NameGive a unique name to your Databricks configuration. This name is used to save and identify your specific Databricks connection details within the Lazsa Platform.
      DescriptionProvide a brief description that helps you identify the purpose or context of this Databricks configuration.
    1. In the Configuration section, provide the following information:

      FieldDescription
      Databricks URL

      Provide the URL for the configured Databricks instance.

      Select Cloud Provider

      Select the cloud provider on which the Databricks cluster is deployed.

      Note:

      You must select the same cloud service provider on which the Databricks cluster is deployed. For example, if Databricks is deployed on Azure, do not select AWS as the cloud service provider.

    2. Depending on the cloud provider on which the Databricks cluster is deployed and how you want to retrieve the credentials to connect to Databricks, do one of the following:

      FieldDescription
      Connect using Lazsa Orchestrator Agent

      Enable this option to resolve your Databricks credentials within your private network via Lazsa Orchestrator Agent without sharing them with the Lazsa Platform.

      • If the Cloud Service Provider is AWS, then provide the following details:

        • Select the Lazsa Orchestrator Agent for AWS

        • Secret Name - provide the name with which the secret is stored in AWS Secrets Manager.

      Databricks Configuration AWS Secrets Manager

      • If the Cloud Service Provider is Azure, then provide the following details:

        • Select the Lazsa Orchestrator Agent for Azure

        • Vault Name - provide the name of the vault where the secret is stored in Azure Key Vault.

      Databricks Configuration Azure Key Vault

      Select Secret Manager
      • Select Lazsa, the user credentials are securely stored in the Lazsa-managed secrets store.
        • Databricks API Token Key - provide the key for Databricks API token.

        Databricks Configuration No Orchestrator Lazsa- managed secrets

      • Select AWS Secrets Manager.
        • In the Secret Management Tool dropdown list, the AWS Secrets Manager configurations that you save and activate in the Secret Management section on the Cloud Platform, Tools & Technologies screen are listed for selection. Select the configuration of your choice.
        • Secret Name - Provide the secret name, with which the secrets for Databricks are stored.

          Databricks Configuration No Orchestrator AWS Secrets Manager
      Databricks API TokenProvide the Databricks API token.
      Provisions user access using Azure AD Group 
      Test ConnectionClick Test Connection to verify the provided details and ensure that the Lazsa Platform is able to successfully connect to the configured Databricks instance.
    3. In the Databricks Resource Configuration section, provide the following information:

      FieldDescription
      Organization ID

      Provide the organization ID. When you log into your Databricks workspace, observe the URL in the address bar of the browser. The numbers following the o= in the URL make up the organization ID.

      For example if the URL is https://abcd-teste2-test-spcse2.cloud.databricks.com/?o=2281745829657864#, the organization ID is 2281745829657864.

      Cluster Name

      Do one of the following:

      Workspace Parent FolderProvide the name of the parent folder in the Databricks workspace that you are configuring.
      Auto Clone Custom Code to Databricks ReposEnable this option if you want your custom code to be automatically cloned to Repos in Databricks. If you disable this option, Lazsa uploads the custom code to the Repos in Databricks.
      Secret Scope

      A secret scope is a collection of secrets that is stored in an encrypted database owned and managed by Databricks Azure. It allows Databricks to use the credentials stored in it locally, eliminating the need to connect to an external secrets management tool, every time a job is run in a data pipeline. You can either use an existing secret scope from the ones already created in Databricks or create a new one. Do one of the following:

      • Select a secret scope from the dropdown list.

      • Click + New Secret Scope to create a new secret scope. Provide a name for the new secret scope and click Create.Create New Secret Scope

    4. Secure configuration details with a password
      To password-protect your Databricks connection details, turn on this toggle, enter a password, and then retype it to confirm. This is optional but recommended. When you share the connection details with multiple users, password protection helps you ensure authorized access to the connection details.
    5. Click Save Configuration. You can see the configuration listed on the Data Integration screen.

       

    Create or Edit Cluster Details

    To create a new cluster provide the following details:

 

Related Topics Link IconRecommended Topics

What's next? Cloud Platforms, Tools and Technologies